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This chapter will provide a brief survey of some of the 
current issues in the literature on testing linguistic and 
communicative proficiency, and will narrow the field somewhat by 
focusing on the testing of reading comprehension as a case in 
point. (2) We will start by discussing several theoretical issues 
in language testing, will then consider several areas of concern 
regarding methods of testing reading comprehension, and will 
conclude with a look at strategies of test takers in dealing with 
reading comprehension tests. 

^ 1^ ^ Tgpw^p i rl Tsstihg 
1- P u r t> T e a t i 

It has bieen demonstrated that tests can be used for 
administrative, instructional, or research purposes (Jacobs et 
al . 1981). In fact, the same test of reading comprehension could 
conceivably be used for twelve different purposes, five 
administrative purposes -- assessment, placement, exemption, 
certification, promotion; four instructional purposes -- 
diagnosis, evidence of progress, feedback to the respondent, 
evaluation of teaching or curriculum; and three research purposes 
~- evaluat ion , experiment at ion , knowledge about language 1 earning, 
and language use. 



(2) I owe my expertise in language testing in no small part to 

Robert Politzer, for it was he who encouraged me to become an 
evaluator of a bilingual education program^ which in turn gave me 
field experience in psychometrics, vjhich afforded me the 
credibility^*Thich led to offers for work which enabled me to get 
even more experience . 
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Given the traditional ways of designing tests of reading 
cdinprehiensibn r the average test id not intended to be used for 
more than several purposes r and the major split is often between 
proficiency Cgitl intended for administrative purposes and 
achi evement tests for asseesment of instructional results. 

Current innovations in testing r however, would suggest that 
the same test could possibly mergie these two different sets of 
purposes under cetain circumstances r i.e., if assumptions of 
design and use are met. In other words, it is being suggested 
that tests used to differentiate people according to general 
level of ability and tests used for certifying the attainment of 
content be combined in one test (Henning 1985) . The suggested 
means for achieving this merger is through item response theory 
(specifically, the Rasch model), wherein a latent "acquisition" 
continuum is inferred both for testing tasks and for the ability 
level of the respondents. In that both respondents' ability and 
item or task difficulty are positioned along the same latent 
continuum, it is thus considered possible to make inferences from 
examinee performance that are referenced to the performances of 
other individuals or to the standards imposed by other tasks. It 
IS argued that by merging proficiency and achievement tests in 
this way, placement can be more in line with what is taught, 
passing from one level of instruction to the next can be 
contingent on actual learning, and the curriculum can be more 
sensitive to individual differences of students at every level. 
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It is important to piorit out that this suggested merger is 
only possible if a number of assumptions about test design are 
met: It assumes that the Rasch model provides a "good fit" for 
the item. In testing language competence, this is problematic 
since the Rasch model requires the items to be constructed along 
a single dimension and yet there is usually a degree of 
multidimensionality in language tests since language competence 
is hot a unitary skill, but rather involves different types of 
skills (see Wodd and Baker 1985) . Other assumptions that are 
disputed include the claims that the item bank will retain stable 
properties over a long period and that it is possible to 
gradually add items without retesting all those in the bank- 
Woods and Barker add that Rasch provides a "sample-free" estimate 
of item difficulty only if the Rasch model provides a perfect fit 
and reflects true item difficulties rather than just estimates- 
However, according to Woods and Barker, random variation in the 
respondents rules out the possibiity of a perfect fit. Thus, 
whereas we need to be open to the possibiity of new groupings of 
test purposes in accordance with advances in the field, we must 
proceed cautiously, weighing the pros and cons of each 
innovation. 

2. Teat Validttv 

The next issue we will consider is that of test validity, 
it is related to testing purpose in that a test can only be 
considered as valid or invalid with respect to some intended 
purpose. Although test validity is often discussed, the actual 
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measure of validity is illusivie. Part of the problem is that, as 
Morrow (1981) points but, there is no such thing as "absolute 
validity." Validity exists only in terms of specified criteria. 
So, if the criteria selected are the wrong ones (i.e., not 
interesting or not useful), then the validity is spurious. Thus, 
the situation may arise wherein a test with admirable qualities 
is invalid in that it is used for iriappropricite purposes. For 
exampl e , a test may be an adequate measure of general pi acement 
but be of limited utility in diagnosis of specific reading 
probl ems . 

Another part of the problem is that certain measures of 
validity lend themselves more easily to more conventional means 
of investigation, while others do not (Underhill 1983). 
Concurrent and predictive validity(3) can be readily asBesse<3 
empirically through correlating results on the test under study 
with scores on other tests considered to be valid in terms of 
specified criteria. Construct, rontent validity, and face 
validity(4), an the txither hancH , are referred to by Und^^'rhill am 



(3) "Concurrent validity" relates to the extent of correlation 
between the test resul ts and those on another test believed to 
measure the same function, both taken at the same time. 
"Predictive validity" deals with the extent to which results on 

the test enable prediction of performance on another test in the 
future . 

(4) "Construct val idity" concerns the extent to which the 
it ems /tasks match the theory behind them . "Content val idity" 
considers whether the items/tasks in the test match what the test 
as a whole purports tc assess. "Face validity" deals with the 
issue of whether the test looks like a reasonable test. 
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forms of "theoretical validation" in that test evaluators must 
rely on "intuition and introspection" for their assessment, 

in recent years the conventional means of assessing 
validation have been questioned and the more unconventional have 
been given more credence . It has been pointed out , for example , 
that assessing both concurrent and predictive validation is not 
so simple. The argument is made that a high correlation between 
two tests does not indicate which is preferable, or if either is 
any good for the given purpose, or whether one can be substituted 
for the other. Rather, it is suggested that trait-method 
interaction may be taking place — i.e., that in a given language 
use situation, indivi^^ual respondents will react differently (Low 
1985) - In addition, it has been suggested that the term "face 
validity" is unfortunate because of its derogatory overtones. 
Low (1985) would offer the term "perceived validity" instead. As 
relates to respondents, then, this form of validation would 
refer to their perceptions as to: 1) any bias in test content 
(i.e., whether the content seems to favor a respondent with 
certain background knowledge or expertise) , 2) the nature of the 
task that they are being requested to perform, and 3) the nature 
of their actual performance on the test as a whol e and on any 
particular subtests (test-taking strategies employed) , 

This concern for giving careful consideration to perceived 
validity comes ^t a time when mentalistic measures are being 
called upon to gather verbal reports from respondents regarding 
the strategies that they are using during the process of taking 
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tiests (Cohen 1984, Cohen 1980, Dblierup et al . 1982). It is 
being demonstrated that the use of inehtalistic measures can yield 
empirical data that provide considerable information concerning 
how respondents perceive tests and how they actually deal with 
them in testing situations- Having looked at the issues of 
clarifying purposes for testing and of considering respondents' 
perceptions of these tests, let us now look at key concerns in 
determining or evaluating methods of testing reading 
comprehension. 

Methods of Testing Reading— Gomp^re hensidn 

Reading comprehension items or procedures require of 
learners that they use a certain type or types of reading, 
comprehend at a certain level or combination of levels of 
meaning, enlist a certain comprehension skill or skills, and do 
ail cf this within the framework of a certain testing method or 
methods. In this section, we will look at some of the choices 
available to the test constructor and considerations of concern 
to the test user. 

i . Type o£ Readinin 

Items and procedures can be written so that they implicitly 
or explicitly call for a given type of reading. For example, a 
respondent can be given a lengthy passage to read in a limited 
time frame such that the only way to handle it successfully is to 
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skiin(5) or to acanJSJ, d^t>enc3ina csn the task. A diHtinction iH 
also made between scanning arid "search reading," where in the 
latter casie the respondent is scannirig without being sure about 
the forin that the information will take (i.e., whether it is a 
word r phrase, sentencer passage, or whatever) (Pugh 1978). A 
respondent could also be given a passage to read receptively (7) . 
Yet another approach is to have respondents read respbrisively , 
such that the written material acts as a prompt to them to 
reflect on some point or other and then possibly to respond in 
writing. Testing formats in which questions are interspersed 
within running text may especially cater to such an approach, if 
the questions stimulate an active dialog betv/een the text and the 
reader . 

The type of reading task is raised here because it would 
appear to be neglected at times in the process of test 
construction. In other words, reading items and tasks are 
sometimes constructed without careful consideration as to how the 
respondent is to read them: It may even be of benefit for the 
test constructor to indicate explicitly to the respondent the 
type of reading expected: For example, a certain item could be 
introduced by the following: 



(5) Overall rapid inspection with periods of close inspection. 

(6) Locating a specific symbol or group of symbols -- e.g., a 
date, a name of a person or place, a sum of money, etc. 

(7) Discovering accurately what the author seeks tj convey. 
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Read thie following text through rapidly (i.e., skim it) in 
order to get the main points. There will not be time to 
read the text i ntens i viely . When you have completed this 
reading, answer the questions provided -- without looking 
back at the text. You will have ten minutes for the 
exerci se . 

Anouhei type of reading constituting a test of its own is 
oral reading. Various oral reading functions could be tested 
such as the giving of a talk from a scripted text, the announcing 
of public information (as if at a train station, airport, etc.), 
the reading aloud of the contents of a pamphlet (giving, for 
example, the operating instruct ions for some appliance), or the 
reading of a children's story. Given that the reading of text as 
oral recitation is not intended to be the same sort of behavior 
as silent reading (involving the skipping of words and phrases, 
regressions, and pauses), oral reading needs to be assessed by 
its own set of cr i ter i a , not by those used for assessi ng s i 1 ent 
reading. For example, a scrip*:ed talk could be assessed in terms 
of smoot^^'iess of delivery, appropriateness of intonation, and so 
forth. The successful reading of a pamphlet could be based on 
whether stress is placed on those items of crucial importance in 
ha" ing the appi i ance operate successf ui iy . 

A possible misuse of oral reading has been as a means cor 
tapping silent reading through assessing -jiiscues — i.e., the 
addition, subtraction, substitution, or transposition of material 



10 



Cohen 9 

while reading aloud (Leu 1982). Effective rieading comprehension 
almost invariabley means silient reading. The reader of a 
scientific paper, for example, may well stop at numerous points 
and go back to check the precise working of earlier parts of the 
article, or periodically jump forward to read the footnotes, the 
references, or pre-read the conclusion (Carre 1981). In short, 
oral reading as recitation is not the same process as silent 
reading . 

2 - Leve 1 M&Bnin^ 

A test i tern or procedure can tap comprehension at one of 
four levels of meaning or at several levels simultaneously: 
grammatical meaning, proposi t ional. meaning , discoursal meaning, 
and pragmatic meaning (adapted from Nuttall 1982) . Note, 
however r that these categories are presented as a rough rule of 
thumb, rather than as a hierarchy of discrete levels. 
Grammat ical meanind dfeals t^ith the meanini:3S that Wbrdi and 
morphemes have on their own. Prdpoal t iona l tnednififj refers to the 
meaning that a clause or sentence can have cn its own, i.e., the 
information that the clause or sentence transmits. This meaning 
is also referred to as its " i!if ormat i onal value." O^^ burg ^ 
meanin g jr^tates ta the meariind a sentemxe can have only when in 
context. This meaning is also referred to as its "functional 
value." Pra<L»mat ic meaning t^cjnixei: n^ th^ meaning that a sentence 
has only as part of the interaction between writer and reader. 
This is the meaning that reflects the writer's fsrlings, 
attitudes, and the inJ"ended effect of the utterance upon the 
reader . 
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The level of meaning that has perhaps gotten the most 
attention in the literature in recient years is the discbursal 
one, especially the perception of rhetorical functions conveyed 
by text. For example, an item may overtly or covertly require a 
respondent to identify where and how something is being defined, 
classified, exemplified, or contrasted with something else. 
Often such "discourse functions" are signaled by connectors or 
"discourse markers." Nontheless, uninformed or unaliert readers 
may miss these signals words or phrases such as "unless," 
"however," "thus," "whereas," and the like. Research has shown 
that such markers need not be subtle to cause reading problems. 
Simple markers of sequential points ("first," "also," and 
"finally") may be missed by a reader as well as more subtle 
markers (see Cohen et al . 1979) . 
3 . Cc^fflfg-ehengiQii Ski 1 1 

Not only must a test constructor and user be aware of levels 
of comprehension, but also of individual skills tested by reading 
comprehension questions at one or more such levels of meaning. 
There are numerous taxonomies of such skills. Alderson (1986) 
offers one which reflects a compilation of others, and includes: 
(1) the ability to recognize words and phrases of similar and 
oppoping meaning, (2) the identifying or locating of information, 
(3) the discriminating of elements or features within context; 
the analysis of elements within a structure and of the 
relationship among iihem e.g., causal, sequential, 
chronological, hierarchical, (4) the interpr'^ting of complex 
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ideas, actions, evients, relationships, (5) inferencing — the 
deriving of conclusions and predicting the continuation, (6) 
synthesis, and (7) evaluation. We note that this taxonomy omits 
thie reader-writer relationship — e.g., the author's distance 
from the text and the level of participation in the text that the 
author requires of the reader. With this taxonomy, as with 
others, the boundaries between skills are assumed to be discrete 
when, in reality, they may not be. 

It is noteworthy that taxonomies of comprehension skills do 
not necessarily imply that the reading of texts requiring the use 
of so-called "higher-order" skills necessarily constitutes a more 
difficult task- In other words, interpreting complex 
relationships may not be any more difficult and perhaps easier 
than recognizing that two words are antonyms in a given context. 
Alderson (1986), for example, reported on a study in which both 
weaker and more proficient Bombay university students had as much 
difficulty with lower-order questions as they had with higher- 
order ones. One explanation given was that whereas the lower- 
order questions measured language skills, the higher-order ones 
measured cognitive skills which the lower-proficiency students 
had no problem with. Another explanation was that the lower- and 
higher-order distihctibh was faulty. Apparently ten expert 
judges at Lancaster disagreed on 27 out of the 40 reading items 
as to what each of them measured - 
4 . Teatinq Methods 



13 



Cohen 12 



Besides consider ihg the type of reading to be performed, the 
desired levels of comprehension, and the comprehension skills to 
be tapped, the test constructor and user needs to give careful 
thought to the testing method; The challenge is to maximize the 
measurement of the trait — i.e., the respondent's ability, while 
minimizing the reactive effects of the method- In order to do 
this, it is useful to be informed as to the options for testing 
with each method and what these options yield. We will look at 
three areas of concern regarding testing method ~- the language 
of response, the cloze and the C-test, and the design of 
genuinely communicative reading comprehension tests, 
a - This Lanquadfe 6i U&mbonm& 
In foreign language tests, item responses have usually been 
in the foreign language, except in translation tasks- In the 
case of open-ended ansv/ers, Laufer (1983) offered three reasons 
why first-language responses might be preferable. She noted that 
when responses are in the foreign language, it is possible to 
copy answers from the text, writing may be of poor quality, and 
the respondent can be terse in order to play it safe, thus 
providing not quite enough information to 3udge whether the 
response is correct. 

Researchers have recently been exploring the effects of 
mixed language formatrs — elicitation in foreign lan«guage, 
response in first language. Shbhamy (198^), for example, found 
that multiple-choice and open-ended responses in first language 
were easier to answer and were probably processed differently 
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than in the foreign language. Although she felt that having 
multiple-choice alternatives in the first language may give clues 
to the meaning of the text, she saw it as eliminating the use of 
tricky look-alike items and unknown distractors- She found that 
with her sample of Israeli twelfth-grade students of English as a 
foreign language , the language used for responses affected lower 
proficiency students more. She concluded that in criterion- 
referenced testing situations, where the jpurpose was to have 
every respondent performance at maximal level, then responses to 
foreign-language items should be in the first language. 

In another study, Zupnik (1985b) had twenty Hebrew-speaking 
intermediate EFL students (in their first year at the university) 
perform two tasks on an English text. In the first task, the 
students were requested to read the Zext and were asked five 
questions in English, two involving definitions, the other three 
involving a rieason, a relationship, and a process respectively. 
In this task they were to indicate the precise line(s) in the 
English text that provided an answer to the question. These 
responses were collected and then the respondents were asked the 
same questions again, but in the second task they were to provide 
open-ended answers for the questions in Hebrew first in rough 
draft , then in a revised version , Finding the rel evant 1 ine of 
text in English was intended to reflect those types of questions 
that can be answered by quoting from the text, thus encouraging 
superfi iai reading. The first-language responses were expected 
to demand a deeper comprehension of the text. 
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The results showed first-language responses to reflect a 
lower lievel of comprehension than the f oreign-languaige respbrisies 
(42% average correct on the Hebrew version vs _ 59% on the Eniglish 
version). Also, although the correlation between performance on 
thie two forms was signficant (p<.b5), it was low (r=-45). The 
researcher concluded that the two tests were in part testing 
different things. She pointed cut that in reading a foreign- 
language text, it is possible to recognize that A causes B 
without understanding what B Jteans. She noted that definitions 
were particularly easy to identify superficially and harder to 
explain in the first language. The item discrimination results 
indicated that the better respondents did better both on 
"locating abilities" (e.g., skimming and scanning), as called for 
in the English-language responses, and on reading in depth, as 
called for in the Hebrew-language responses. The better 
respondents were also more likely to paraphrase the relevant 
material from the text when responding in their first-language 
rather than translating word-for-word (85% of responses from the 
better students vs. 57% of responses from the weaker students), 
b: T he Clgge and the C-Teat 
The origins of the cloze test date back farther than many 
would think — to 1897, in fact. At that time, Ebbinghaus 
proposed a series of tests that had one- or two-word deletions, 
rational deletion, and partial deletion from the beginning or end 
of words (Ebbinghaus 1897) . There is a controversy concering the 
cloze test as to whether filling in cloze items is not just a 
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matter of perceiving local redundancy, but rather, involves an 
awareness of the flow of discourse across sentences and 
paragraphs, as Oiler (1979, ch. 12) maintains. Whereas recent 
research would suggest that traditional fixed-word deletion is 
more of a micro~ievel completion test (a measure of word- and 
sentence-level reading ability) than a macro-level measure of 
ski 1 1 at understanding connected discourse (Alder son 198 3 , Klein- 
Braley 1981), ehavez-Oiler et al . (1985) have recently come out 
with yet another claim that cloze is sensitive to constraints 
beyond 5-1 i words on either side of a blank , based on a 
reanalysis of earlier data; 

As an alternative to the fixed-word deletion, researchers 
have turned to the rationale deletion cloze, whereby words are 
deleted according to predetermined, primarily linguistic criteria 
-- often stressing the area considered to be underr epr esented , 
namely, macro-level discourse links (Eevenston et al. 1984). 
Research by Bachman (1985) with EFE university students found 
that the rational deletion approach sampled much more across 
sentence boundaries and somewhat more across clause boundaries 
within the same sentence than did the fixed-ratio cloze. He 
concluded that the rational deletion cloze was a better measure 
of the reading of connected discourse, although he questioned its 
construct validity. Bachman found that while the rational 
de let ion procedure affords the test developer a better means for 
making judgements regarding the content validity of such tests, 
the question remains as to whether such tests "in fact measure 
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the components of language proficiency hypothesized by the 
deletion criteria" fBachman 1985:556) ~- i.e., the flow of 
discourse across sentences and paragraphjis within a text: Markhain 
(1985), for example, would contend that even the rational 
deletion cloze does not measure comprehension of connected 
discourse. He gave 84 English-speaking university students of 
German an original and a scrambled version of a rational deletion 
cloze and found that neither were testing for global reading 
ability. Thus, the controversy continues. 

A suggested alternative to the cloze test , namely the C- 
test, has been proposed by Klein-Braley and Raatz (Raatz 8. Klein- 
Braley 1982, Klein-Braley 8. Raatz 1984, Klein-Braley 1985, Raatz 
1985). in this procedure, the second half of every other word is 
deleted, leaving the first and the iast sentence of the passage 
intact. A given G-test consists of a number of short passages 
(maximum 100 words) on a variety of topics. This alternative 
eliminates certain problems associated with cloze, such as choice 
of deletion rate and starting point, representational sampling of 
different language elements in the passage, and the inadvertent 
assessment of written production as well as reading. With the 
test, being given a clue (half the word) serves as a stimulus for 
respondents to find the other half. The following is one passage 
within a C-test (from Raatz 1985) : 

Pollution is one of the big problems in the world 
today. Towns a--——— cities a——-—- growing, indu_^».__. 
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is qro — --, arid t— — — — population o^__--;^ the 

wb^^_-_ is gro~— . Almost every^ causes 

poll ,. -., - in so— way b— — — — — another. T:r--":z ^ir 

i^~.--.~ filled wi— — -— — fumes fr— — — — ^ factories a-— -r ;- 

vehicles, a there i noise f r_ airplanes 

a , . machines. R j v- — — --— r lakes a_^~_^ seas a~~— ~— 

polluted b— — factories and by sewage from our homes . 

At present it would appear that the C-test may well be a 
more reliable and valid means of assessing what the c3oze test 
assesses, but as suggested above, it is still hot clear to what 
extent the C-test tests more than micro-level processing. 
Because half the word is given, students who do not understand 
the macro-context can still mobilize their vocabulary skills 
adequately to fill in the appropriate discourse connector witout 
indulging in higher-level processing. This was the finding from 
research Using Hebrew C-tests (Cohen et al . 198^) . (8) Exttnaive 
riesearch on what processing of C-test items actually entails is 
currently underway using data from protocols of German 
speakers 'Verbal reports while taking French and Spanish C-tests, 
and morie information will be available in the near future 
(Grot jahn 1986) . 



(8) Lo (Personal Communication) suggests that the C-test is a 
d i f f erent test for VO languages as opposed to by languages 
because verbal affixes and morphology are in different positions. 
For example , a Gael ic C-test would only give the f i rst 1 etter of 
a mutation and frequently, the letter given would be for the 
affix not £or the noun atea. 
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c . Cotnmuni cat ive Teats of Reading ComprehehBion 
For years attention has been paid to so-called 
"cbininunicat ive tests" — usually implying tests dealing With 
speaking. More recently, efforts have been made to design truly 
communi cat i ve tests of other 1 anguage ski 1 1 s as we 1 1 , such as 
reading comprehension. Cahale (1984) points out that a good test 
is not just one which is valid, reliable, and practical in terms 
of test administration and scoring r but rather one that is 
acceptable — i.e., accepted as fair, important, and interesting 
by test takers and te^st users. (9? Also, a gocsd teat has feedback 
potential — rewarding bcth test takers c.nd test users with 
clear, rich, relevant, and general izable information. Canale 
suggests that acceptability and feedback potential have often 
been accorded low priority, thus explaining the curious 
phenomenon of multiple-choice tests claiming to assess oral 
interaction ski 1 Is . 

Some recisnt approaches to communicative testing were in part 
an outgrowth of a theoretical framework propcfied by Canale and 
Swain (1980), which offered a basis for communicative testing. 
This framework defined four types of competence that need to be 
considered in assessing communicative ability: grammatical. 



(9) This position is an endorsement of the need to take into 
account "perceived validity" (Low 1985), as discussed above. 
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discoursal, sbciocul tural , and strateg ic . (10 ) Bath Sw^iri and 
Canale undertook to construct coromuhi cat ive tests consistent with 
thieir framework. The particular variety of communicative test 
that they dealt with has been referred to as a "storyline" test, 
a test with a line of devielbpment . In such a test, there is a 
common theme running throughout iri order to assess context 
effects. The basiG for such an approach is that the respbnder.ts 
learn as they read on, that they double back and check previous 
content, and that the ability to use language in conversation or 
writing depends in large measure on the ski 1 1 of picking up 
information from past discussion and using it in formulating new 
strategies (Low, in press) . 

Swain (1984), for example, developed a storyline test of 
French as a foreign language for high-school French immersion 
students. The test consisted of six tasks around a common theme, 
"finding summer employment." There were four writing tasks (a 
letter, a note, a composition, and a technical exercise) and two 
speaking tasks (a group discussion and a job interview) , The 
test was designed so that the topic would be motivating to the 
students and so that there would be enough new information 



^16) "Grammatical competence" refers to mastery of the 
features and rules of the language^ "discoursal competence" to 
cohesion (local i^nks within the text ) and coherence 
(interpretation and use of connected utterances in a meaningful 
whole) , "sociocuiturai competence" to sociocui tural rules of 
appropriateness (status , purpose , norms of interact ion) , and 
"strategi c competence" to ways of compensating for imperfect 
knowledge of rules (such as through paraphrase, shifts in 
register , etc . ) . 
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provided in order to give the tasks credibility. Swain provided 
the iriespondents with sufficient time, suggestions as to how to do 
the tiest,. and clear knowledge about what was being tested. There 
was access to dictionaries and other reference material, and 
opportunity to review and revise their work. Swain's main 
concern was tc "bias for best" in the construction of the test 
to make every effort to support the respondents in doing their 
best of the test. (11) 

Canale also provided a design for a communicative storyline 
test — for administration to University-level learners of 
English as a second language in Ontario (Canale 198^). The 
example provided had a suggested theme, "a day in the life of a 
student." It consisted of four phases, a warm-up, a leve] check, 
a probe, and a wind-up. The w^t^m -u^ weis intended to put test 
takers at ease and to familiarize them with the language and the 
interviewer. The given example was that of "choosing one's 
courses," intended to afford the respondents an opportunity to 
decide which form of the test they wanted to try, an easier or a 
more difficult one. The 4e v e l irh e cck id'sntified the prbf ic:ien»xy 
level at which the test taker performs bv^st. The example 
provided dealt with "applying for a job or for aid," and 
consisted of short-answer responses. 



i^^l ^-^^.P*^^^?.^^?? A? cases of bias can be viewed 

as a good thing as intentional bias . The aim would be to set 
up tasks that test takers will be mot ivated to part ici pate in , 
such as those tha': approximate real-life situations (Spolsky 
1985) . 
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The ^ rr » b e w^s int«;ndec3 ta ixhr^llenae t*sat takers with t^aka 
just beyond their identified level in order to verify maxiinuin 
proficiency and to show the test takers tasks which were still 
beyond their ability. in this subtest, respondents were asked to 
select a topic for a course report or take-home exam within their 
own discipline area. The wihrJ -ti^ was K.inie.d at th»$ tftrit taket^a- 
best performance level in order to have them end with a sense uf 
accomplishment. Test takers who took the same discipline- 
specific subtests vere asked to engage :.n a semi-directer3 
conversation on two themes: what each re«:pondeht proposed in the 
just-comploted writing task and what they thought of the testing 
experience . 

Canale (1985) views communicative tests such as that 
described above as "prof iciency-orienteJ achievement tests,'* 
which is consistent with Henning's (1935) bjggested "marriage" 
between proficiency and achievement testing mentioned above. 
Canale offers five reasons for taking this view: 

(1 ) Such teEits put to use what is learned . There is a 
transfer from controlled training to real performance. 

'2) There is a focus on the measage, the function, and the 
form, not just on the form. 

(3) There is group collaboration as well as individual 
work, not just the latter. 

(4) The respondents »are cal led upon to lise their 
resourcefulness in resolving authentic problems in Ifihguage use 
as opposed to accuracy in resolving contrived problems at the 
linguistic level. 
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(5) The testing iLself is more like learning, and the 
learners are inoro involved in the assessment. 

Corainuhi cat i ve storyline tests have also received criticism 
for various reasons (Jones 1984; Liskin-Gasparro 1934; Low, in 
pr«i3s) . The following are some of the reserv'-at ions made about 
such types of tests: 

(1) In order to approximate real life more, it is necessary 
to move away from mass administration and scoring, which is less 
practical and less cbjecti/e. Tests that are acceptable (fair, 
important, and interesting) and give feedback are usually small- 
scale, classroom tests. 

(2) 'i^ith a themat:c organization, there i3 less efficiency 
because learners need to produce more text or respond to fewer 
items . 

(3) such a teirt limits the variety of language material and 
thus leads to content bias expressly because the focus of the 
test is narrow. 

(4) There is the poissibi 1 1 ty of contamination — that a 
question relating to tho. first part of the test will be 
unintentionally answered in a later section. The fact that 
learnei^s can use information from earlier parts of the test in 
answering subsequent questions lowers the test's reliability, 

(5) It is difficult to design such tests because of the 
need to have genuine links between sections without having them 
too interdependent , 
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(6) There is a potential shock effect if respondents have 
not been tested by this approach before. 

It would appear that such criticisms need to be taken into 
account T/hen considering the use ct* coiumunicat ive tests. There 
appear to be clear acivaritages co pursuing such testing 
apprcachei?r accepting their 1 in<i tatibns . A Hebrew Univer.^ity 
seminar paper (Brill 1986), for cviainple, had thirty-two ninth- 
grade HebretA? speakers complete a coirmunicat ive storyline test, 
including five tasks dealing with membership in a youth 
group . (12) The atudent j were them ^sked to conip<are their 
experience on this test and on the traditional multiple-choice 
one they had taken previousiy. They almost unanimously endorsed 
the communicative test as preferable because it was more 
creative, allowed them to express their opinions, was more 
interei^t ing , taught them how to make contact with others, and 
invest i^;;ited communication skills besides reading comprehension. 
For these reasons, they felt that it provided a truer measure of 
their competence Lhan did the traditional test. 

d - CQtttDUter i gg-d Ad^pt i ve Test ind (CAT^ 



ji2)^ tasks included: writing a letter as a response to 

friend interested in a youth movement the respondent belongs to, 
presenting questions to the group leader to get more information 
on the movement^ announcement about the movement to 

post on bulletin boards^ writing out a telephone request for 
information on how a local foundation could aid the movement, and 
writing out a telephone response to an invitation by a political 
group to join a demonstration of theirs. 
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Computerized adaptive testing (CAT) of readin<3 comprehension 
implies an approach to testing whereby the selection and sequence 
of iteirs depends on the pattern of success and failure 
exper ii '^'jnr^'d by the respondent. Most commonly, if the respondent 
succeeds on a giver? item, one of greater difficuftty is presented, 
and if the respondent experiences failure, then an easier item is 
pr^:3ented- The testing continues T.:ntil sufficient information 
has been gathered to assess the particular respondent's ability. 
At present,, such tests are mostly limited to objective formats, 
such as .riL' i tiple-choice , Based on item response theory(13), GAT 
is known to be more efficient and .iiore accurate than conventional 
fixed-length tests employing multiple-choice itetis (Tung 1986) . 

Among the advantages of CAT are the following: individual 
testing time may be reduced, frustration and fatigue are 
tninimizod, boredom is reduced, test scores and diagnostic 
feedback may be provided immediately, test security may be 
enh-mced (rinct: it is unlikely that two respondents would receive 
the same items in the same sequence) , record-keeping functions 
are improved, and information is readily available for research 
purposes (Henning, in press). The main disadvantage is that 
given its present item-response-theory basis, CAT requires that 



(13) Item response theory (also referred to as "latent trait 

measurement") refers primarily to analytical procedures for 
quant if ying the probability of individual item and person 
response pat terns given the overall pattern of responses in a set 
of test data (Henning 1984). Reference was also made to item 
response theory above under "Theoretical Issues in Testing." 
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the construct to be itieasured be unidiinensional i.e., be 
assumed to involve only onie major factor or underlying trait. It 
is suggested that such an assumption threatens to trivialize and 
compromise the existing theories of reading comprehension, which 
include multiple dimensions, such as world knowledge, language 
and cultural background, type of text, reading styles, and so 
forth, and fails to take into consideration various subcomponents 
of reading, along with the influences of instruction (Canale 
1986) . 

The line of development that Canale (1986) would propose for 
CAT is that it move from simply mechanizing existing product- 
oriented reading comprehension item types to the inclusion of 
more process-oriented, interactive tasks that can be integrated 
into broad and thematically coherent language use/learning 
activities, such as "intelligent tutoring systems (14) 

Test-Taking Strategi es 

The strategies that respondents use in taking tests have 
implications both for the issue of test validity and "bias for 
best." Tests that are relied upon to indicate the comprehension 
level of readers may produce misleading results because of 
numerous techniques that readers have developed for obtaining 
correct answers on such tests wi thout f ul ly or even part ial ly 



^14) In intel 1 igent tutor ing systems, the computer diagnoses 
the students' strategies and their relationship to expert 
strategies, and then generates instruction based on this 
comparison: 
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understanding the text. As Franssbn (1984) puts it, respondents 
may not proceed via the text but rather around it. In effect, 
then, there are presumptions held by test constructors and 
administrators as to what is being tested and there are the 
actual processes that test takers go through to produce answers 
to questions and tasks- The two may not necessarily be one and 
the same- It may also be that the strategies the respondents are 
using are detrimental to their overall performance, or at least 
not as helpful as others they could be using. 

Mentalistic measures using verbal report have helped 
determine how respondents actually take reading comprehension 
tests as opposed to what they may be expected to be doing (Cohen 
1984). Studies calling on respondents to provide immediate or 
delayed retrospection as to their test-taking strategies 
regarding reading passages with multiple-choice items have, for 
example, yielded the following results: 

(1) Whereas the instructions ask students to read the 
passage before answering the questions, students have reported 
either reading the questions first or reading just part of the 
article and then looking for the corresponding questions, 

(2) Whereas advised to read ail alternatives before 
choosing one, students stop reading the alternatives as soon as 
they have found one that they decide is correct: 

(3) Students use a strategy of matching mater i al from the 
passage with material in the item stem and in the alternatives, 
and prefer this surface-structure reading of the test items to 
one that calls for miore in-depth reading and inferericihg. 
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(4) Students rely on their prior knowledge of the topic and 
on their general vocabulary. 

Recent Hebrew University student seminar papers have 
provided innovations in two areas of investigation regarding 
test-taking strategies in the use of first-language responses 
to foreign-language passages and in the use of a response- 
strategy checklist used after each response. The first study had 
tv7o Hebrew-speaking respondents, a strong and a weak reader 
respect ively r engage in reading comprehension testing tasks 
(Zupnik 1985a) . The students read an EFL text and answered five 
questions in English by indicating the line(s) in the text that 
provided an answer to the question, and then answered the same 
questions again, this time providing open-ended answers for the 
questions in Hebrew (as in Zupnik 1985b, mentioned above). 

Both respondents were trained to produce think-aloud and 
self-observational data (15), and were then asked to provide such 
data regarding both language tasks before answering the questions 
in writing. The poor reader was found to use four times as many 
reading strategics on tl^a Engl ish response task than did the 



(15) "Think-aloud" data reflect stream-bf-cbnscioushess 
disclosure of thought processes whi le the information is being 
attended to. Such data are basically unedited and unahalyzed. 
"Self-bbservatibn, " bn the other hand, refers to the inspection 
bf specific reading behavior, either while the infbrmatioh is 
still in short-term memory, i.e. , introspect ively, br after the 
event, i . e. , retrospect ively (usually after 20 seconds or so). 
It does entail analysis and editing of the data to a lesser or 
greater degree. 
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strong reader (using Sarig's taxbhomy of reading strategies (16) ) . 
Both readers used a similar number of strategies on the Hebrew 
response task. As to the type of reading strategies used, it was 
found that the better reader used monitoring strategies most of 
all in both languages^ while the poorer reader relied mostly oh 
clarification and simplification strategies, V7ith very limited 
use of monitoring strategies. Furthermore, most of the 
strategies of the stronger reader were comprehehsibh-prbmot ing , 
while those of the poorer reader were often comprehension- 
deterring- As in the companion group study (Zupnik 1985b) , thi s 
case study confirmed the hypothesis that quoting rhetorically- 
focused foreign-language segments from text encourages more 
superficial reading than answering in the first language. 



(16) On the basis of protocol analysis bf high-school students 
reading Hebrew as a first language and English as a foreign 
language, Sarig (1987) designed a taxbhomy of "reading move 
types," which includes four broad categbries of moves, or 
strategies : technical-aid moves (reading acts undertaken to 
facilitate higher-level moves --e.g., skimming fbr the purpose 
of determing the macro-f rame bf the text and notes taken whi le 
reading) , clarif i cat ion and simpl if icat ion moves (semantic- 
decoding moves, involving paraphrase to simplify syntax, 
vocabulary, ideas , or rhetorical functions) , coherencie-detect ing 
moves (using textual or extra- textual clues to make the text 
meaningful — e^g., through textual and content schemata, 
rhetorical functions, ideas and views expressed), and monitoring 
moves (conscious strategies for checking on the reading process 
--e.g., awareness of the task being performed, identification of 
misunderstanding and incompaatibi 1 i ty of formerly interpreted 
material with newly interpreted material, awareness of other 
failures in comprehension, and awareness of resources for remedy 
and likelihood of success) . 
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ThiB second piece of innovative riesearch on test taking dealt 
with the refining of a research methodology for tapping test- 
taking strategies. The issue under study was whether it is 
possible to collect introspective and retrospective data from 
students just after they have answered each item on a test. The 
approaches reported on in previous work have involved at most a 
request of respondents after they have finished a subtest or 
group of items that they reflect back as to the strategies that 
they used in arriving at answers to those items (Cohen 1984) . In 
an effort to provide immediate verbal report data, Nevo (1985) 
designed a testing format that would allow for immediate feedback 
after each item. She developed a response-strategy checklist, 
based on the test-taking strategies that have been described in 
the literature and on her intuitions as to strategies respondents 
were likely to select. A pilot study had shown that it was 
difficult to obtain useful feedback on an item-by-item basis 
without a checklist to jog the respondents' memory as to possible 
strategies . 

Nevo ' s checklist included fifteen strategies, each appearing 
with a brief description and a label meant to promote rapid 
processing of the checklist (see Figure 1), She administered a 
multiple-choice reading comprehension test in Hebrew first- 
language and French foreign-language to forty-two iSth graders, 
and requested that they indicate for each of the ten questions on 
each test, the strategy that was most instrumental in their 
arriving at an answer as well as that which was the second most 
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iristruinental . The responses were kept anonyinous so as to 
encourage the students to report exactly what they didr rather 
than what they thought they were supposed to report. 

It was found that students were able to record the two 
strategies that were inost instrumental in obtaining each answer. 
The study ind i cated that respondents transferred test-taking 
strategies from first language to foreign language. The 
researcher also identified whether the selected strategies aided 
in choosing the correct answer. The selection of strategies that 
did not promote choice of the correct answer was more prevalent 
in the foreign-language test than in the first-language version. 
The main finding in this study was that it was possible to obtain 
feedback from respondents on their strategy use after each item 
on a test i£ a checklist was provided for quick labeling of the 
processing strategies utilized. 

Futhermore, the respondents reported benefiting greatly from 
the opportunity to become aware of how they took reading tests. 
They reported being basically unaware of their strategies prior 
to this study. (17) 

In terms of the actual strategies used for answering the 
multiple-choice tests in Hebrew as a first language and French as 
a foreign language , Nevo found that "returning to the text to 



(17) What was not looked at were the carry over effects of 
this study on those same respondents the next time that they took 
a reading test. Such research would help to determine whether 
this awareness is only temporary or whether it has a lasting 
effect . 
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look for the correct answer after reading the questions'' and 
"looking for clues to the answer in the section of text that the 
question referred to" were the two most frequently reported 
strategies, both in first and in foreign language. In foreign 
language, however, respondents were somewhat li.ss likely to 
return to the text in general, probably reflecting the greater 
processing difficulties this involved. The ma^or difference in 
first-language vs. foreign-language test-taking strategies was 
that in first language, "guessing without any particular 
consider at ions" was rarely ut i 1 i zed , whi 1 e in f ore i gn- language 
responses, it was reportedly used for 20%-30% of the items on the 
test, Nevb ' s study pinpointed not only the frequency of 
guessing, but the specific items for which it was reported , (18 j 

From these findings and from others, there is emerging a 
description of what respondents do to answer questions. Unless 
trained to do otherwise, they may use the most expedient means of 
responding available to them such as relying more on their 
previous experience with seemingly similar formats than on a 
close reading of the description of the task at hand. Thus, when 
given a passage to read and summarize, they may well perform the 
task the same way they did the last summary task , rather than 
paying close attention to what is called for in the current one. 



(18) This study made a dichotomy between guessing without any 
particular considerations and_ not guessing. In reality, there is 
a continuum from guessing without considerations to thoughtful 
guessing to non-guessing. 
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Often^ this strategy works, but oh occasion the particular task 
may require subtle or ma^or shifts in response behavior in order 
to perform well. 

There appears to be a further insight to be gained froiu the 
test strategy literature, namely, that indirect testing formats 
— i.e., those which do net reflect real-world tasks (e.g,, 
multiple-choice, cloze, etc.) — may prompt the use of strategies 
solely for the purpose of coping with the test format. More 
direct formats such as summarizing a test may be free of such 
asdded testing effects- However, as long as t^.a task is part of 
a test, students are bound to use strategies they would not use 
under non-test conditions. It is largely the responsibility of 
test constructors and of those who administer such tests to be 
aware of what their tests are actually measuring. Verbal report 
techniques can assist the test developer and user in obtaining 
such information. 

insights about the way in which respondeTits go about 
performing different testing tasks can be used to make informed 
decisions as to: (1) the choice of testing format, (2) the 
chbicie and wording of instructions, and (3) the value and 
feasibility of coaching the respondents in how to take language 
tests. Work by O'Malley (19G6) and others has already made use 
of research findings in designing training modules for the 
learning of test- taking skills. 

Gon<:lugions 
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This chapter has not 6.*:teinpted to survey the whole field of 
testing as it relates to linguistic arid cbiiiinuriicat ive 
pr ^ "iciency . Rather, it has touched on some of the issues 
regarding the testirig of reading comprehension that have beeri of 
major concern to test developers, test users, and test takers 
during recent years , Reconsider at ion of the purposes f c r tests 
and of how to combine purposes has been a key interest in this 
chapter , as have quest i oris of test val idat ion. Sometimes careful 
attention is given to irinovatibri in testirig method whether 
through cloze, C-testiag, or through computerized adaptive 
testing — without commensurate atteritibri paid to the type of 
reading being called for, the level of comprehension desired, and 
the comprehension skills to be elicited. For this reason, 
attention was given to these factors here. 

During this period of awakened interest in learners' 
processing of language, it seems fitting that we should pay extra 
attention to the actual strategies being used in test taking. 
There is no doubt that test constructors and test users can 
receive beneficial feedback from inquiries into what the given 
tests actual ly prompt respondents to do beyond thei r 
ex'^ectat ions or assumptions. As for test takers, they are 
sometimes if not frequently oblivious to how they are answering 
test items, possibly to their detriment. It is possible that 
they would become more effective at "jaking tests if they were 
informed as to what they are doing at present and as to what they 
cotiid be doing that would yield better results. 
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Figure 1 



Strategies for Answering Multiple-Choice Reading 
Comprehentijion Questions (Froir. Nevo 1985) 



1 - Bg.ckqt-Qund kricpwledge ; qttner^l fcriowleddp outside the ti^xt . 
2. atjepgiri a -: queHairia without any p^rticul^r crrisideriat ions . 
3- Retut^rU^ q to the D ^#a^# -: returhina to the text to look for 
the correct answer, after reading the questions and multiple- 
choice alternatives- 

4. Chronological order -: Iddfcind for the Answer in chrbnolodical 
ordier in the passage. 

5. CUuea in the text - Idc^tirid the are^ in the text that the 
question referred to and then looking for clues to the answer in 
that context . 

6. eeasin^a-a earc h— at p^ ttsibl e c h oic e = readind the alternative 
choices until reaching one that was thought to be correct. Not 
continuing to read the rest of the choices. 

7. Prds" : eaa d#-e^3Hrffljrhat^r^h ; aelectiha ah alterneitive nut b.'^cauae 
it was thought to be correct but because the others did not seem 
reasonable, seemed similar, or were not understandable. 

8. Choosing the exception : auat^ecting a choice to be the correct 
answer because it constituted an exception or had something 
different about it. 

"9^- IiMllSiEfcLl bfirid drawn to an alternative: because it was 
longer/shorter. 

AQ^_ j-^^l^g^^^^ location of the alternative 

within the set of alternatives: 

11. Contmon word j chooaing an alternative because it had in it a 
word that was common — that was heard all the time. 

12 . K^y word j arriving at an altar native because it had in it a 
word that appeared to be a key word: 

i3: Matching the stent with an alternative : aelew'ting an 
alternative because it had in it a word/words that appeared in 
the item stem as well: 

14: ftaagclat i gn j gglectin.j th^ al ternat ive becauHe. it had a word 
in it that evoked an association with a word in the first 
language or in another language^ 

±5^ Matching the Question with the text : aelectirtd_an alternative 
because it had a word/words that also appeared in the text, 
because it had words similar in sound, meaning, or belonged to 
the same word famij. iy, or because it just seexiied to be :."elated. 
16. Other strat egy 
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